Machine Learning - Assignment

April, 2022

"I worked and submitted alone"

Daniela Nogueira

Libraries

1. Data Inspection and Visualization

The challenge is based on a regression problem that involves predicting energy efficiency. More specifically, I will perform an analysis using various building shapes, each with its own set of attributes, and predict the building's heating load. The buildings differ in terms of glazing area, glazing area distribution, orientation, and other characteristics included in the dataset.

The dataset includes 9 attributes denoted by X0, X1,..., X8, as well as an outcome variable Y that must be predicted. The following are the definitions of the ten variables:

Definitions

"The ‘thermal envelope’ of a building is the union of those structures that separate the conditioned part of the building (subject to being heated and/or cooled) from the outside (including neighboring buildings) or from other parts of the building that are not conditioned."

General information

Generally, there are no clear patterns we can infer from the functions above, we use them to get a general sense of data. Our dataset has 768 entries/rows and 10 columns and, except for variable X0, all variables are floating-point numbers. However, one of the columns, specifically the variable  X3 , contains only 728 lines with non-null values, indicating that missing data exists. In addition, we can see a discrepancy between the median and the mean in the various columns, which suggest some skewness, meaning the data is not distributed perfectly symmetrical, as the mean and median values would otherwise be equal. The standard deviation, which represents the dispersion of a set of data values from their mean, is another interesting metric to consider here. When a variable's standard deviation is low, it means that data points are close to the mean, and vice versa. When there is missing data, such as in the case of variable X3, this value might assist us in making decisions about how to manage the problem.

Ouliers and Variables Distribution

We can see that the variables X1 and X3 have a few outliers. However, because these outliers are not dramatically distance from the other points, they do not seem to be influential points ("extreme outliers") that would significantly affect the slope of the regression line, so I do not believe it is necessary to exclude them from the data. They appear to be critical to our analysis and results. Additionally, it should be noted that, as seen before, none of our numerical data follows a normal distribution; yet, the variables X4 and X7 are the closest to this scenario.

Correlations and Visualizations

The corr() function has as default method Pearson Correlation in which coefficients are indicators of how strong is the linear relationship between two different variables, x and y. A linear relationship is used to describe a straight-line relationship between two variables. Under this assumption, the two variables have a direct connection, which means if the value of x is changed, y must also change. For example, if the volume of a material is doubled, the weight of the material will also double, this is a perfect linear relationship.

The correlation coefficient is given in the range -1.0 to 1.0. A positive association is indicated by a correlation coefficient greater than zero, which means that if the value of one variable rises, the value of the other tends to increase as well. On a scatterplot, positive relationships produce an upward slope. A negative correlation is shown by a correlation coefficient smaller than zero, which means that as the value of one variable rises, the value of the other variable tends to fall. Negative relationships produce a downward slope. In other words, a positive correlation indicates that both variables are moving in the same direction, whereas a negative correlation indicates that they are moving in opposite directions. Finally, a value of zero shows that the two variables x and y have no linear relationship, which means that changes in the output are not proportional to changes in the input and, in these case, when one variable increases, there is no tendency in the other variable to either increase or decrease.

There is a relationship when the value is between 0 and +1/-1, but the points do not all fall on a line. The strength of the association grows as r approaches -1 or 1, and the data points tend to fall closer to a straight line.

The predictors that appear to have a statistically significant relationship to the response are: X5, X1, X3, X7, X2 and X4. However, some variables present stronger correlations with the y variable than others, presenting correlation coefficients closer to 1 and -1. We also observe that some of our predictors have substantial associations with one another, as seen, for example, by the positive correlation values of 0.57 for variables X1 and X5 and variables X4 and X2. Consequently, an effect in one of the is absorbed by the other.

The positive or negative linear relationship between the predictors that present stronger correlations with the variable y; the regression coefficients that the further away from zero the greater the impact of the predictor on the variable y; the distributions and density of the variables through the box and violin plots; and, finally, the R-squared that represents the proportion of the variance for a dependent variable that is explained by the independent variable are all illustrated in the scatterplots below. As a result, the higher the R-squared the better. Additionally, the R-squared provides a measure of the strength of the relationship between the model and the response variable.

2. Data Preprocessing

Discarding non useful variables

Handling Categorical data

Treatment of categorical variables is one of the data preprocessing phases, and it is a crucial step because most machine learning algorithms can not handle categorical variables unless they are converted to numerical values, and even those that can perform better when all variables are numeric and preferably of the same dtype. Ordinal Encoding and One-Hot Encoding are two of the most used methods for dealing with categorical variables. In this example, we will use the One-Hot Encoding approach, which can be done in two ways: one, by using get_dummies in pandas and two, by using OneHotEncoder from sklearn. Ordinal Encoding maps each unique category value to a specific numerical value depending on its order or rank, converting ordinal categories into ordered numerical values. In our case, we do not want to provide our categorical variable any order, and we do not want our model to account for it, resulting in poor performance.

Missing data

When dealing with missing data, we have 3 options:

When the fraction of missing data is small, the imputation method can be quite useful. If the portion of missing data is too high, the imputation method will result in data that will lack natural variation . On the other hand, it is important to remember that the value we use to fill in the missing entries, such as the column's median, is sensible and appropriate for the variable we're working with. As a result, it is important to try to figure out why the data is missing and to fully understand the variable we are dealing with. If the percentage of missing data is small, the choice to delete rows with missing data also makes sense; otherwise, we may wind up with an insufficient number of observations to produce a reliable analysis. Finally, removing the variable/column from our analysis appears sustainable only if we are faced with a high percentage of non-existent values, such as 60%, because the observations we have may be questionable and not representative of the population, and choosing any of the other options would result in one of the less favourable scenarios mentioned above.

Nearest Neighbor Imputation with KNNImputer

Using a model to forecast missing values is an effective technique to data imputing. For each feature with missing values, a model is built using the values of perhaps all other input features as input. In our case, I'll use the Nearest Neighbor Imputation using KNNImputer technique, in which the mean value from n_neighbors nearest neighbors found in the training set is used to impute each sample's missing values.

Data splitting

In order to start training models to predict heating load, we need to split up the data into an X array that contains the features to train on, and a y array with the target variable. This division will be based on the stratified sampling approach, which is a method of obtaining a representative sample from a population that has been separated into approximately comparable subpopulations.   Stratified sampling is used by researchers to ensure that specified subgroups are represented in their sample. As a result, it aids in retaining the whole diversity of the population in the sample. When members of subpopulations are homogeneous in comparison to the total population, stratified sampling can give more accurate estimates than simply random sample. This increases the statistical power of a study. For our division, we will use the X5 variable because it has a stronger correlation with the Y variable.

We can observe that the stratified sample test set's distribution of the variable X5 is considerably more close to the original dataset than the random test set's distribution.

Scaling data

It would be difficult to feed values into a model that have widely varying ranges. The model would be able to adapt to such a wide range of data automatically, but it would make learning more challenging. As a result, feature normalisation (Min-Max scaling) or standardisation (Z-Score Normalization) is required. The first is rescaling the data from the original range so that all values are within the new range of 0 and 1. This scaling is really affected by outliers and can be useful in algorithms like K-Nearest Neighbors and Neural Network that do not assume any distribution of the data; the second is standardisation, which involves rescaling the distribution of values so that the mean of observed values is 0 and the standard deviation is 1. Because there is no predefined range of transformed features, this scaling is less affected by outliers, and it is more effective when data follows a Normal distribution. However, this last part does not have to be necessarily true, but it is in general advantageous when data has variable sizes and the technique you are using, such as linear regression, logistic regression, and linear discriminant analysis, makes assumptions about the data having a Gaussian distribution. Taking everything into account, and considering the models I will be using to train our data, I will use standardisation. However, I believe that normailization would also do a great job.

Standardisation was only applied to my numerical columns, not on the One-Hot Encoded features. Standardising the One-Hot encoded features would mean assigning a distribution to categorical features and we do not want to do that. The One-Hot encoded features are already in the range of 0 to 1, therefore normalisation would have no effect on their value if I chose to normalise the data. It is also worth mentioning that the numbers used to standardise the test data are generated from the training data. Even for something as simple as normalisation or standardisation, we should never use any quantity computed on the test data.

Feature selection

I'm going to use the Random Forest Importance technique for feature selection. Random Forests are a type of Bagging Algorithm that aggregates a set of decision trees. Random forests' tree-based tactics are naturally ranked by how well they increase node purity, or, in other words, how well they reduce impurity over all trees. The nodes with the largest drop in impurity are found at the beginning of the trees, while the notes with the least decrease in impurity are found at the end. We can produce a subset of the most essential features by pruning trees below a specific node.

Fine-Tune Your Model

Validation set

We shouldn't tweak the model depending on the test set to avoid information leaks, hence we need also set aside a validation set. Setting aside a portion of the training set as the validation set, training on the remaining data, and then evaluating on the validation set. In the test set, we just evaluate our model once. Additionally, the validation set is also used to optimize the model parameters.

Machine Learning Models

Regression Evaluation Metrics

Mean Squared Error (MSE)

Mean of the squared errors.

Where y_i is the i’th expected value in the dataset and yhat_i is the i’th predicted value. The difference between these two values is squared, which has the effect of removing the sign, resulting in a positive error value.

Large errors are also inflated as a result of the squaring. The wider the discrepancy between the predicted and true values, the larger the squared positive error. When MSE is employed as a loss function, this has the effect of "punishing" models more for higher errors. When employed as a metric, it also has the effect of "punishing" models by raising the average error score.

The units of the MSE are squared units.

Root Mean Squared Error (RMSE)

The square root of the mean of the squared errors.

Where y_i is the i’th expected value in the dataset, yhat_i is the i’th predicted value, and sqrt() is the square root function.

The RMSE, or Root Mean Squared Error, is a variant of the mean squared error. However, as the square root of the error is calculated,  the RMSE units are the same as the original units of the target value. For higher values, MSE is heavily biased. When dealing with huge error values, RMSE is superior at expressing performance.  As a result, when lower residual values are preferred, RMSE is more useful.

Mean Absolute Error (MAE)

Mean of the absolute value of the errors.

Where y_i is the i’th expected value in the dataset, yhat_i is the i’th predicted value and abs() is the absolute function.

MAE is a common statistic because, like RMSE, the error score's units match the units of the target value. The changes in MAE, unlike the RMSE, are linear and thus intuitive.

MSE and RMSE punish greater errors more harshly than smaller ones, inflating the mean error score. The MAE does not give distinct types of errors more or less weight; instead, the scores grow linearly as the amount of error increases.

R-squared (R2)

The R-squared value is always between 0 and 100%. R-squared is a statistical measure of how close the data are to the fitted regression line, indicating the percentage of the variance in the dependent variable that the independent variables explain collectively.  R-squared cannot determine whether the coefficient estimates and predictions are biased, so we must analyse the residual plots. In some fields, it's completely normal to have low R-squared values. R-squared values of less than 50% are normal in any science that seeks to predict human behaviour, such as psychology.

Cross-Validation

Cross-validation is a resampling technique for evaluating machine learning models, specially, on a small sample of data. The process includes only one parameter, k, which specifies the number of groups into which a given data sample should be divided. As a result, the process is frequently referred to as k-fold cross-validation. For each group i, we train a model on the remaining K-1 partitions , and evaluate it on the partition i. The final score is the the average of the k-scores obtained. This method is helpful when the perfomance of a model shows significance variance based on the train-test-split. Importantly, each observation in the data sample is assigned to a distinct group and remains there throughout the process. This means that each sample has the chance to be used in the hold out set once and to train the model k-1 times. This strategy provides a more accurate estimate of the algorithm's performance on new data because the algorithm is trained and assessed several times on different data. The size of each test partition must be large enough to constitute a good sample of the problem, whilst allowing enough repetitions of the train-test evaluation of the algorithm to provide a fair estimate of the algorithms performance on unseen data. K values of 3, 5, and 10 are common for modest size datasets with thousands or tens of thousands of observations.

Note: The cross_val_score() function from scikit-learn allows us to evaluate a model using the cross validation scheme and returns a list of the scores for each model trained on each fold.

1. Linear regression

Let's start with linear regression, also known as ordinary least squares (OLS) and linear least squares, is the simplest algorithm in machine learning, but despite its simplicity it often works well for different types of data. In machine learning problems, if you have a choice between two models, one sophisticated and the other much simpler, we should choose the simpler one if the simpler model represents the data as well as the complex model.

One of the most common problems in model training is called overfitting, and it occurs when the model fits the training data extremely well, learning the details, patterns, and noise in the training data, what negatively affects the model's performance in new data, never seen before. In other words, the model fits the training very closely but perform poorly in the unseen/test data. The goal of machine learning is generalisation, the ability of make good predictions on new data.

As can be seen above (heatmap), there is a substantial correlation between the independent variables, which might rise to an issue known as multicollinearity, which favours the overfitting scenario.  This correlation is a concern because independent variables should be independent.  Furthermore, when independent variables are correlated, it means that shifts in one variable are linked to shifts in the other. It is more difficult to adjust one variable without changing another when the relationship is significant and it becomes challenging for the model to estimate the relationship between each independent variable and the dependent variable individually. 

A regression coefficient represents the mean change in the dependent variable for each 1 unit change in an independent variable when you hold all of the other variables constant.

Multicollinearity is responsible for the following two sorts of issues:

OLS has several weaknesses, including a sensitivity to both outliers, multicollinearity and heteroscedasticity.

Heteroscedasticity refers to situations where residuals for a regression model do not have a constant variance. If the scatter of residuals is unequal, the population used in the regression has unequal variance, then the conclusions of the study might be dubious.

Heteroscedasticity is responsible for the following two sorts of issues:

Note: The p-values here test the null hypothesis that a regression coefficient (beta) is zero, so there is no relationship between it and the target.

Assumptions of linear regression:

Evaluation

The Residuals vs Fitted plot shows if residuals have non-linear patterns. There could be a non-linear relationship between predictor variables and an outcome variable and the pattern could show up in this plot if the model does not capture the non-linear relationship. In this case, the model seems to be close to linear. In general, we can find equally spread residuals around the horizontal line which is a good indication that we do not have significant non-linear relationships.

The Normal Q-Q plot shows that the residuals are normally distributed and they follow the straight line well.

2. Ridge Regression

Overfitting and multicollinearity are two issues that may arise as a result of the data training process. Regularization is one of the techniques that has been developed to help mitigate these events, consisting  in adding a penalty term to the objective function. Regularization reduces the model's variance without considerably increasing its bias, resulting in more useful coefficient estimates.

Note: Multiple Linear Regression function: y = β0 + β1X1 + ... + βiXi + ε, where "y" is the response, "β0" is the intercept, others beta are the regression coefficients, "X" are the independent variables, and "ε" is the model error. The objetive function, used also to find the best coefficients, is: Min ∑ ε² = Min ∑ (y - (β0 + β1X1 + ... + βiXi))² . In the linear regression objective function we try to minimize the sum of squares of errors.

Ridge Regression makes use of L2 regularisation which tries to minimize the objective function by adding a constrain term (λ, lambda) to the sum of the squares of coefficients. In other words, it imposes a penalty on the size of coefficients. The intercept term is not regularised. The constraint only applies to the sum of squares of the X's regression coefficients. Ridge Regression has the same assumptions as linear regression, with the exception that normality in the error terms. Ridge reduces the value of coefficients but does not reach zero, implying that no feature selection is taking place.

The objective function is: Min (∑ε² + λ∑β²) = Min ∑(y - (β0 + β1X1 + ... + βiXi))² + λ∑β²

Tuning Ridge Hyperparameters

3. Lasso Regression

Lasso stands for Least Absolute Shrinkage and Selection Operator. It makes use of L1 regularisation technique in the objective function.   Lasso penalises the size of the regression coefficients in the same way as Ridge Regression does.   Lasso regression varies from ridge regression in that it uses absolute values in the penalty function rather than squares, resulting in certain parameter estimates being exactly zero, which aids feature selection. Furthermore, by identifying a simpler model, it is capable of minimising the variability and enhancing the accuracy of linear regression models. This prevents the model from becoming overfit. When the independent variables are extremely collinear, Lasso regression selects only one variable and shrinks the others to zero.

The objective function is: Min (∑ε² + λ∑|β|) = Min ∑(y - (β0 + β1X1 + ... + βiXi))² + λ∑|β|

4. Elastic net

It is a combination of both L1 and L2 regularization.

The objective function in case of Elastic Net Regression is: Min (∑ε² + λ∑|β|) = Min ∑(y - (β0 + β1X1 + ... + βiXi))² λ∑β² + λ∑|β|

Like ridge and lasso regression, it does not assume normality.

Model Comparison

I decided to use the Root Mean Squared Error (RMSE) metric for Cross-Validation to compare the models and pick the best one for the dataset in question. This decision was made because, as previously stated, the Cross-Validation approach allows the model to be trained in numerous train-test splits, making it more likely to perform well on unseen data. Furthermore, the fact that we split the dataset into multiple folds and train the algorithm on different folds prevents our model from overfitting the training dataset. Finally, another advantage of k-fold cross-validation is that each data point is tested exactly once and is used in training k-1 times. As k increases, the variance of the final estimate decreases, reducing bias significantly. As a result, the model achieves generalisation capabilities, which is an indication of a robust method.

Why RMSE?

Since we want to minimize the errors, and being RMSE a negatively-oriented score, lower values are better.

According to our loss function, RMSE Cross-Validation, the algorithm, Lasso Regression, is the model that best fit the data and also have the lowest RMSE value, 4.645083. Additionally, this model performs better in other measures, presenting the highest R-squared of 0.778435, indicating that almost all the variance in the Y variable is explained by the independent variables collectively, and the lowest MAE_CV, meaning that on average, the predicted Heating Load is far just 3.577086 compared to the true values.

5. Neural Networks

Before training we need to transform our data into numpy arrays, our network will expect samples as vectors (1D arrays) of floating point numbers.

Baseline model

The network ends with a single unit and no activation (it will be a linear layer). This is a typical setup for scalar regression ( a regression where we are trying to predict a single continues value). Applying an activation function would constrain the range the output can take; for instance, if we applied sigmoid activation function to the last layer, the network could only learn to predict values between 0 and 1.

MAE evolution per epoch

It may be a little bit difficult to see this plot , due to scaling issues and relatively high variance. So we going to:

According to this plot, validation MAE stops improving significantly after 100 epochs. Past that point, we start overfitting

Add momentum and SELU as activation function

With SGD, momemtum addresses two issues: convergence speed and local minima (stuck point). Gradient descent works in one direction to find the global minimum (derivative = 0), which is the set of weight values that produces the least loss function feasible. During this process, optimization may become trapped at a local minimum rather than progressing to a global minimum. The parameter W (dot(w,input)+b) is updated by momentum based on both the current gradient (derivative) value and the prior update.

When compared to the results of our base model, the MAE and MSE values from this model after the introduction of momentum and the changing of the activation function from RELU to SELU show a significant improvement. SELU achieves self-normalization of the network, the output of each layer tends to preserve mean 0 and standard deviation 1, and it helps with vanishing gradients, when  gradients are small or zero, and the weights and biases of the initial layers are not updated effectively with each training session.

Add L1 regularization

In machine learning problems, if you have a choice between two models, one sophisticated and the other much simpler, we should choose the simpler one if the simplest model represents the data as well as the complex model. Overfitting is less common with simpler models. If parameters fill a tiny interval near to zero, they are regular. Large weigh parameters will amplify the noise and the network will be atempt to the noise. A model with smaller parameters is more robust.Putting constraints on a network's complexity by limiting its weights to take more regular values is a frequent approach to avoid overfitiing. This is known as weight regularisation, and it is accomplished by including a cost associated with large weights in the network's loss function.

For obvious reasons, L1 regularisation is more resilient than L2 regularisation. L2 regularization takes the square of the weights, so the cost of outliers present in the data increases exponentially. L1 regularization takes the absolute values of the weights, so the cost only increases linearly. As previously said, Lasso was the best performing linear regression model; this model uses L1 regularization, thus we will use it for our neural networks model as well.

When compared to the results of our last model, the MAE and MSE values from this model after the introduction of L1 regularization show a sightly improvement.

Add Batch Normalization

Batch normalization is a type of layer, it can adaptively normalize the data even as the mean and variance change over time during training. It works by internally maintaining an exponential moving average of the batch-wise mean and variance of the data seen during training. The main effect of batch normalisation is that it helps with gradient propagation, much like residual connection, and thus allows for deeper networks. The BatchNormalization layer takes an axis argument which specifies the feature axis that should be normalised. This argument defaults to -1, the last axis in the input tensor

When compared to the results of our last model, the MAE and MSE values from this model after the introduction of batch normalization show a sightly improvement.

Add Dropout

For several neural networks, dropout is one of the most successful and widely used regularisation strategies. During training, dropout involves arbitrarily drooping out, or setting to zero, a number of the layer's output features.The dropout rate is the fraction of features that are zerout. It is normally in the range of 0.2 to 0.5. The idea behind it is that by introducing noise in the output values of a layer can break up insignificant patterns that the network would start memorising if there was no noise.

Note: Batch normalization does depend on the statistics of the distribution. So, if you have a dropout before a batch normalization, batch normalization will have different results during training and validation.

With the addition of dropout, we see that the results were sightly worse than the previous model. So, I will tune this model without dropout.

Hyperparameter tuning

Callbacks

Evaluate on the test set only the 2 best models

Neural Networks

Lasso

We can observe that, when compared to Lasso, the Neural Network model fits the data substantially better, with significantly better MSE and MAE results. Neural networks act like a general-purpose program that identifies patterns and can generate output patterns that match them. Because of the huge number of neurons involved, it is capable of doing so. As a result, the complexity and quantity of connections are replacing the complexity of traditional  applications. Before Lasso saturates, it selects a maximum of n variables. Lasso is unable to do group selection. When there is a collection of variables with very high pairwise correlations, the Lasso prefers to pick only one variable from the group at random. This is one of Lasso limitations which may have affected its performance.

References